Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

نویسنده

Lu Li

چکیده

CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems. Such heterogeneity represents one of the most promising trends for the near-future evolution of high performance computing hardware. However, as a double-edged sword, the heterogeneity also brings significant programming complexities that prevent the easy and efficient usage of different such heterogeneous systems. In this thesis, we are interested in four such kinds of fundamental complexities that are associated with these heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring enegy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new lowcost programming abstractions to hide these complexities, and propose new optimization techniques that could be performed under the hood. For the measurement complexity, although measuring time is trivial by native library support, measuring energy consumption, especially for systems with GPUs, is complex because of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy, and switching to different measurement metrics. We propose a clean interface with its implementation that not only hides the complexity of energy measurement, but also unifies different kinds of measurements. The unification bridges the gap between time measurement and energy measurement, and if no metric-specific assumptions related to time optimization techniques are made, energy optimization can be performed by blindly reusing time optimization techniques. For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware, we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically, and shows non-trivial advantages over random sampling. For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system, enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive. For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systems that have separate memory spaces. This mechanism enables programmers to write significantly less code, which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia’s Unified Memory. We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization. The two techniques are based on adaptively merging messages to reduce data transfer latency. We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal. “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page iv — #4 Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously. This research has been partly funded by two EU FP7 projects (PEPPHER and EXCESS) and SeRC. “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page v — #5 Populärvetenskaplig Sammanfattning Vi lever i ett samhälle där vetenskap och teknik utvecklas i allt snabbare takt, och där datorer numera finns överallt. Människan har blivit beroende av datorer för att kunna sköta sitt dagliga arbete och även för underh̊allning och kommunikation. Ett modernt liv kan knappast föreställas utan datorer. I v̊art mycket datoriserade samhälle kommer produktiviteten och välfärden drabbas signifikant om datorernas prestanda sviker. Snabbare datorer kan även öppna nya vägar i forskningen inom andra omr̊aden, s̊asom djupinlärning-teknologin som möjliggör självkörande bilar mm. De kan även underlätta upptäckter i andra vetenskapliga grenar. Till exempel, genom att köra större simuleringar kan precisare experimentella data genereras, vilket dock kräver en snabbare arbetsstation eller även en superdator. Eftersom datorer är s̊a betydelsefulla och man fortsätter flytta mer uppgifter till dem s̊a jobbar vi inom vetenskap och teknik h̊art för att ytterligare förbättra datorers prestanda. För att n̊a detta m̊al m̊aste programvaran och h̊ardvaran samarbeta p̊a ett bättre sätt. Tidigare kunde programvaran profitera automatiskt fr̊an snabbare h̊ardvara i varje generation och därmed själv bli snabbare. Men dessa gamla goda tider är över. Än värre: snabbare datorer förbrukar ocks̊a betydligt mer energi, vilket skapar nya problem för samhället. Lösningen som h̊ardvaruindustrin tagit till sedan ca 2005 var överg̊angen till fleroch m̊angkärniga datorarkitekturer, d.v.s. parallella, distribuerade och oftast heterogena datorsystem där vanliga processorer (CPU) kompletteras med grafikprocessorer (GPU) eller andra former av programmerbara h̊ardvaruacceleratorer. Dessa system erfordrar komplex programmering och noggrann, ressursmedveten optimering av programkoden för prestanda och energieffektivitet. Det är en stor utmaning för mjukvaruingenjörer att skapa snabb kod för dessa komplexa datorarkitekturer som kan möta det moderna samhällets stadigt ökande prestandakrav. Dessutom kan den snabba utvecklingstakten i h̊ardvaran leda till inkompatibilitet eller ineffektivitet av redan existerande programvara p̊a nya h̊ardvarugenerationer. Sammanfattningsvis s̊a finns huvudsakligen fyra problem: (1) Det är sv̊art att skriva effektiv programkod. (2) För existerande prestandakritisk programkod är det sv̊art att garantera att den överhuvudtaget kan köras p̊a varje ny h̊ardvarugeneration. (3) Även om själva koden är portabel s̊a är det sv̊art att automatiskt bibeh̊alla effektivitetsniv̊an p̊a nästa h̊ardvarugeneration. (4) Vi behöver metoder som kan optimera inte bara programmets exekveringstid utan även dess energianvändning. I denna avhandling utforskar vi programmeringsabstraktioner (t.ex. för programvarukomponenter) och tekniker för heterogena datorsystem som tar itu med dessa problem. V̊ara metoder och ramverk avlastar programmeraren fr̊an flera viktiga uppgifter utan att negativt p̊averka mjukvarans prestanda. (A) En av ansatserna automatiserar minneshanteringen och optimerar dataöverföringen s̊a att programmet exekverar snabbare än h̊ardvarutillverkarens egen automatiserade lösning. Samma ansats gör det möjligt “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page vi — #6 för programmeraren att skriva kompaktare, mer läsbar kod som dock exekverar lika effektivt som expert-handskriven kod, och därmed ökar programmerarens produktivitet. (B) Vi utvecklade ett plattformsbeskrivningsspr̊ak som underlättar att systematiskt beskriva komplexa datorsystem med deras h̊ardvaruoch systemprogramvarukomponenter, och som kan främja portabilitet, optimering och adaptivitet av programvara till exekveringsplattformen. (C) Vi utvecklade en ny mekanism för konstruktion av smarta prediktorer som kan göra programexekveringen adaptiv till exekveringsplattformen, möjliggör effektiv användning av h̊ardvaran, och visar signifikanta förbättringar jämfört med state-of-the-art lösningen. (D) Vi överbryggar gapet mellan prestandaoptimering och energioptimering p̊a ett sätt som möjliggör att under vissa förutsättningar återanvända prestandaoptimeringstekniker för att f̊a en reduktion av programmets energiförbrukning. Slutligen kan vi nyttja alla dessa metoder och ramverk samtidigt genom att integrera dem p̊a ett lämpligt sätt. Vi gör v̊ara programvaruprototyper allmänt tillgängliga med öppen källkod. P̊a det sättet kan de användas (och faktiskt redan har använts) t.ex. av andra forskare inom v̊art omr̊ade för att hantera vissa av de ovannämnda komplexiteterna och som byggstenar i andra forskningsprototyper. “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page vii — #7 Popular Science Summary We live in a society where science and technology are evolving faster than ever, and computers are everywhere. People rely on computers to perform their daily jobs and get entertainment. Modern life is hard to imagine without computers. In the heavily computerized society where we are living, it will significantly harm the society’s productivity and welfare if computers run slowly. Moreover, faster computers can unlock the true power of research in other field, like deep learning technology that enables self-driving cars. They can also facilitate discoveries in other scientific areas, e.g., more precise experimental data can be obtained by running larger simulations which requires a faster work station or even a supercomputer. Since computers are so important and we keep putting more tasks on them, scientists and engineers are working hard to further improve their performance. To achieve such a goal, software and hardware must collaborate. In old times, software could rely on faster hardware in each generation, thereby making itself run faster automatically. But the good old days are gone, possibly forever. To make things worse, faster computers also bring significant more energy consumptions. The alternative is to introduce multicore/many core designs in our computers that lead to a scalable and sustainable energy increase but require parallel and distributed programming of often heterogeneous systems with GPUs and careful optimizations for performance and energy efficiency. Producing fast-running software on these complex parallel computers, to meet the insatiable needs of society, is very challenging for software engineers, not even considering that the fast-evolving hardware may break or run very inefficiently the software already produced. In summary, there are four main problems: 1) it is hard to produce fast software; 2) for already produced high performance software, it is hard to guarantee that they could still run on each generation of hardware that appears frequently; 3) it is hard to automatically maintain its efficiency on time on each new generation of hardware; 4) we need methods to lean more towards reducing energy consumption of software in addition to making it faster. In this thesis, we explore new programming abstractions (for software components) and techniques to tackle these problems. We remove four important responsibilities (handling of measurement complexity, CPU/GPU selection, plaform complexity and data management) from software engineers without sacrificing software performance. VectorPU enables software engineers to write significantly less code still with the same efficiency as expert-written code, resulting in a productivity boost. VectorPU allows software to run significant faster than the current commercially available solution. We design a new platform description language XPDL to systematically describe a computer system, and protect software to be broken by different machines, and possibly by future computers. We design a new construction mechanism for smart predictors that can make software executions adaptive to different machines and allow efficient hardware utilization, “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page viii — #8 and that shows non-trivial advantages over the state-of-the-art solution. We bridge the gap between performance optimization and energy optimization, thus if no metric-specific assumptions related to time optimization techniques are made, we can easily reuse performance optimization techniques to reduce energy consumption instead. Finally, we gain all those benefits simultaneously by integrating them in meaningful ways. We make our designed software framework prototypes available as open source, thus these prototypes can (and already did) help other researchers to tackle these complexities, and utilize those prototypes for new knowledge generation. “lu ̇phd ̇thesis ̇draft ̇v0.1” — 2018/2/23 — 14:51 — page ix — #9

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...

متن کامل

Accelerating high-order WENO schemes using two heterogeneous GPUs

A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...

متن کامل

Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU

Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study u...

متن کامل

Fast Cellular Automata Implementation on Graphic Processor Unit (GPU) for Salt and Pepper Noise Removal

Noise removal operation is commonly applied as pre-processing step before subsequent image processing tasks due to the occurrence of noise during acquisition or transmission process. A common problem in imaging systems by using CMOS or CCD sensors is appearance of the salt and pepper noise. This paper presents Cellular Automata (CA) framework for noise removal of distorted image by the salt an...

متن کامل

Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers

In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated userlevel software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

نویسنده

چکیده

منابع مشابه

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

Accelerating high-order WENO schemes using two heterogeneous GPUs

Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU

Fast Cellular Automata Implementation on Graphic Processor Unit (GPU) for Salt and Pepper Noise Removal

Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers

عنوان ژورنال:

اشتراک گذاری